304 ◾ Bioinformatics
insert transposomes into the metagenomic DNA. The tagmentation is followed by PCR
using index primers which enables amplification and subsequent indexing of the sample
libraries (barcoding) to allow multiplexing. The library preparation is followed by sequenc-
ing and production of the raw data in FASTQ files. The steps of read quality assessment and
processing are, to some extent, similar to the steps discussed in Chapter 1. The purpose of
the quality control is to reduce sequence biases or artifacts by removing sequencing adap-
tors, trimming low-quality ends of the reads, and removing duplicate reads. If the DNA is
extracted from a clinical sample, an additional quality control step is required which is to
remove the contaminating host DNA or non-target DNA sequences. If we need to perform
between-sample differential diversity analysis, we may also need to draw a random sub-
sample from the original sample to normalize read counts.
After the step of quality control, there are two strategies that can be followed for the
metagenomic raw data. The first one is to assemble the metagenomes using de novo genome
assembly method and the second one is an assembly-free approach similar to amplicon-
based method. Each of these strategies may address different kinds of questions. The types
and algorithms of the de novo assembly were discussed in Chapter 3. However, in the
shotgun metagenomics, a new step is introduced. This step is called metagenomic bin-
ning, which aims to separate the assembled sequences by species so that the assembled
contigs in a metagenomic sample will be assigned to different bins in FASTA files. A bin
will correspond to only one genome. A genome built with the process of binning is called
Metagenome-Assembled Genome (MAG). Binning algorithms adopt several ways to per-
form binning. Some algorithms use taxonomic assignment and others use properties of
contigs like GC-content of the contigs, nucleotide composition, or the abundance. Binning
algorithms use two approaches for assigning contigs to species: supervised machine learn-
ing and unsupervised machine learning. Both approaches use similarity scores to assign
a contig to a bin. Since many of the microbial species have not been sequenced and hence
some of the reads may not map to reference genomes, it is good practice to not rely on
mapping to reference genomes. Binning-based nucleotide composition of a contig has been
found useful in separating genomes into possible species. The nucleotide composition of
a contig is the frequency of k-mers in the contig, where k can be any reasonable integer
(e.g., 3, 4, 5, …). It has been found that different genomes of microbial species have dif-
ferent frequencies that may discriminate the genomes into potential taxonomic groups.
A machine learning algorithm like naïve Bayes and other machine learning algorithms
are used for taxonomic group assignment. However, features, more powerful than the
sequence composition, are required to deal with the complexity in the sequences of contigs.
The unsupervised machine learning tools cluster contigs into bins without requiring prior
information. There are several binning programs using different algorithms. MetaBAT 2
[1] uses an adaptive binning algorithm that does not require manual parameter tuning as
the case with its previous version. Its algorithm consists of multiple aspects, including nor-
malized tetranucleotide frequency (TNF) scores, clustering, and steps to recruit smaller
contigs. Moreover, the computational efficiency has been increased compared to the previ-
ous version. MaxBin [2] uses nucleotide composition and contig abundance information
to group metagenomic contigs into different bins; each bin represents one species. MaxBin
algorithm uses tetranucleotide frequencies and scaffold coverage levels to estimate the